Project Objective¶

As a Data Analyst and a baseball fan, I wanted to explore the data surrounding the attendance for my local minor league baseball team, the Ogden Raptors. Specifically, I wanted to focus on two data analysis topics:

  • Web Scraping

  • Data Visualization

In [1]:
#import packages
import pandas as pd
from os import path
from bs4 import BeautifulSoup as Soup
import requests
import numpy as np
from pandas import DataFrame
import plotly.express as px
import plotly.graph_objects as go

Raptors Data Scrape¶

The first step in this project was scraping that Raptors' schedule and results from their 2022 schedule site. The steps to do this are in the code below.

In [2]:
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
att_response = requests.get('https://ogden-raptors.com/sports/bsb/2022/teams/ogdenraptors?view=attendance', headers=HEADERS)

att_soup = Soup(att_response.text)
tables = att_soup.find_all('table')

att_table = tables[15]
att_rows = att_table.find_all('tr')

def parse_att_row(row):
    """
    Take in a tr tag and a tag and get the data out of it in the form of a list of
    strings.
    """
    return [str(x.string) for x in row.find_all(('td','a'))]

list_of_parsed_att_rows = [parse_att_row(row) for row in att_rows[1:]]
att_df = DataFrame(list_of_parsed_att_rows)

Data Cleaning¶

Now that we have the dataframe, the next step is cleaning the data and creating new columns to be used for this project and potential future projects.

In [3]:
# Relevant cols
att_df = att_df[[0,1,3,4]]
# Remove empty space befire att
att_df[4] = att_df[4].str.strip()
#remove empty space before result
att_df[3] = att_df[3].str.strip()
#remove unplayed games (either haven't occurred yet or were cancelled)
att_df = att_df[att_df[3] != "-"]
#Create location col
att_df['location'] = np.where(att_df[1].str.contains('at', na  = False), 'Away', 'Home')
att_df['location'] = np.where(att_df[1].str.contains('vs.', na  = False), 'Away', att_df['location'])
# Clean up opponent col
att_df[1] = att_df[1].str.replace('at\r\n', '')
att_df[1] = att_df[1].str.replace('vs.\r\n', '')
att_df[1] = att_df[1].str.strip()
# Name Cols
att_df.columns = ['Date', 'Opponent', 'Score', 'Attendance', 'Home_Away']
# Add Result
att_df['Result'] = att_df['Score'].str[0]
# Clean Score
att_df['Score'] = att_df['Score'].str.replace('W,', '')
att_df['Score'] = att_df['Score'].str.replace('L,', '')
att_df['Score'] = att_df['Score'].str.strip()
#Win Loss constants
att_df['win_constant'] = np.where(att_df['Result'] == "W",1,0)
att_df['loss_constant'] = np.where(att_df['Result'] == "L",1,0)
#win total
att_df['total_wins'] = att_df['win_constant'].cumsum()
#pregame win total
att_df['pregame_total_wins'] = np.where(att_df['Result'] == "W", att_df['total_wins'] - att_df['win_constant'], att_df['total_wins'])
#loss total
att_df['total_losses'] = att_df['loss_constant'].cumsum()
#pregame loss total
att_df['pregame_total_losses'] = np.where(att_df['Result'] == "L", att_df['total_losses'] - att_df['loss_constant'], att_df['total_losses'])
#game number
att_df['game'] = np.arange(len(att_df))+1
#pregame win %
att_df['pregame_win_%'] = att_df['pregame_total_wins'] / (att_df['game'] - 1)
#Score modification
att_df['Score'] = att_df['Score'].str.replace('-', '|')
#Date Type
att_df['Date'] = pd.to_datetime(att_df['Date'] + ' 2022', infer_datetime_format=True)
#Day of week col
att_df['Day_of_Week'] = att_df['Date'].dt.day_name()
#Attendance Clean
att_df['Attendance'] = att_df['Attendance'].str.replace(',', '')
att_df['Attendance'] = att_df['Attendance'].astype(str).astype(int)
#day_classifier
weekend_list = ['Friday', 'Saturday']
pattern = '|'.join(weekend_list)
att_df['day_classifer'] = np.where(att_df['Day_of_Week'].str.contains(pattern),'weekend','weekday')
#add datestring
att_df['Date_string'] = att_df['Date'].dt.strftime('%Y-%m-%d')
att_df['Date_string'][68] = att_df['Date_string'][68] + ' Game1'
att_df['Date_string'][69] = att_df['Date_string'][69] + ' Game2'
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\2924845361.py:14: FutureWarning: The default value of regex will change from True to False in a future version.
  att_df[1] = att_df[1].str.replace('vs.\r\n', '')
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\2924845361.py:54: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  att_df['Date_string'][68] = att_df['Date_string'][68] + ' Game1'
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\2924845361.py:55: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  att_df['Date_string'][69] = att_df['Date_string'][69] + ' Game2'

Data Preview¶

In [4]:
att_df[:5]
Out[4]:
Date Opponent Score Attendance Home_Away Result win_constant loss_constant total_wins pregame_total_wins total_losses pregame_total_losses game pregame_win_% Day_of_Week day_classifer Date_string
0 2022-05-25 Grand Junction Jackalopes 12|7 3127 Home W 1 0 1 0 0 0 1 NaN Wednesday weekday 2022-05-25
1 2022-05-26 Grand Junction Jackalopes 12|7 3940 Home L 0 1 1 1 1 0 2 1.000000 Thursday weekday 2022-05-26
2 2022-05-27 Grand Junction Jackalopes 10|7 3687 Home L 0 1 1 1 2 1 3 0.500000 Friday weekend 2022-05-27
3 2022-05-28 Grand Junction Jackalopes 10|4 2575 Home W 1 0 2 1 2 2 4 0.333333 Saturday weekend 2022-05-28
4 2022-05-31 Boise Hawks 9|6 2804 Away W 1 0 3 2 2 2 5 0.500000 Tuesday weekday 2022-05-31

Filter for Relevant Data¶

For the scope of this project, I'm only choosing to look at data from the Raptors' home games.

In [5]:
home_df = att_df[(att_df['Home_Away'] == 'Home') &( att_df['Attendance'] >0)].reset_index()
home_df = home_df.drop(['index'], axis=1)
home_df['game'] = np.arange(len(home_df))+1

Explore and Visualize¶

First let's simply look at the attendance from each of the home Raptors games.

In [6]:
#Explore
fig = px.bar(home_df, x='game', y='Attendance', hover_name = 'Date_string')
fig.update_xaxes(
    tickvals = [*range(1,len(home_df),3)],
    ticktext = home_df['Date_string'][[*range(0,len(home_df),3)]],
    tickangle = 45  
)
fig.show()

Is there a correlation between 'attendance' and 'day of week'?¶

In [7]:
day_of_week_df = home_df.groupby('Day_of_Week')\
.agg({'Date' : 'count', 'Attendance' : 'mean'})\
.rename(columns = {'Date' : 'Count of Games', 'Attendance' : 'Average Attendance'})\
.reset_index()
day_of_week_df['day_of_week_num'] = np.where(day_of_week_df['Day_of_Week'] == 'Sunday', 1,
                                            np.where(day_of_week_df['Day_of_Week'] == 'Monday', 2,
                                                    np.where(day_of_week_df['Day_of_Week'] == 'Tuesday', 3,
                                                            np.where(day_of_week_df['Day_of_Week'] == 'Wednesday', 4,
                                                                    np.where(day_of_week_df['Day_of_Week'] == 'Thursday', 5,
                                                                            np.where(day_of_week_df['Day_of_Week'] == 'Friday', 6,7))))))
day_of_week_df.sort_values(by=['day_of_week_num'], ignore_index=True)
Out[7]:
Day_of_Week Count of Games Average Attendance day_of_week_num
0 Sunday 5 2048.400000 1
1 Monday 1 1873.000000 2
2 Tuesday 6 3587.166667 3
3 Wednesday 7 3303.142857 4
4 Thursday 9 4328.222222 5
5 Friday 9 4334.444444 6
6 Saturday 7 4010.285714 7

It appears Thursday, Friday, Saturday are the most attended Raptors games (based on average attendance)¶

In [8]:
fig = px.bar(day_of_week_df, x='Day_of_Week', y='Average Attendance', hover_data= ['Count of Games'], category_orders = {'Day_of_Week' : ['Sunday', 'Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday']})
fig

Weather Scrape¶

I'd like to see if the local Ogden weather on the day of a Raptors game has any correlation with attendance. To do this, I need to scrape weather data from almanac.com. After creating the function to scrape the Almanac weather website, I need to run it just for the dates of the Raptors home games. The steps to do this are in the code below.

In [9]:
def parse_row2(row):
    """
    Take in tr tag and get data in list of strings.
    """
    return [str(x.string) for x in row]

def scrape_weather_day(day):
    weather_response = requests.get(f'https://www.almanac.com/weather/history/zipcode/84404/{day}', headers = HEADERS)
    weather_soup = Soup(weather_response.text)
    weather_tables = weather_soup.find_all('table')
    weather_table = weather_tables[0]
    weather_ps = weather_table.find_all('p')
    weather_parsed_rows = [parse_row2(row) for row in weather_ps]
    weather_df = DataFrame(weather_parsed_rows)
    weather_df = weather_df[0].to_frame().T
    weather_df = weather_df[[0,1,2,6,10,11]]
    weather_df.columns = ['min_temp','mean_temp','max_temp','total_precipitation','mean_wind_speed','max_sustained_wind_speed']
    weather_df['day'] = day
    return weather_df
In [10]:
# run the weather function for the Raptors game dates
rap_game_list = home_df['Date'].dt.strftime('%Y-%m-%d').unique().tolist()
raptor_weather = pd.concat([scrape_weather_day(x) for x in rap_game_list]).reset_index(drop = True)
raptor_weather['day'] = pd.to_datetime(raptor_weather["day"])
raptor_weather.head()
Out[10]:
min_temp mean_temp max_temp total_precipitation mean_wind_speed max_sustained_wind_speed day
0 45.0 62.5 79.0 0.00 7.02 13.81 2022-05-25
1 46.9 72.2 91.9 0.00 6.10 11.39 2022-05-26
2 55.0 77.9 91.9 0.00 7.83 14.96 2022-05-27
3 53.1 65.3 86.0 0.00 9.67 20.83 2022-05-28
4 51.1 67.0 80.1 0.00 4.49 10.24 2022-06-07

Join weather data w/ Raptors Game data¶

As you can see in the data below (far right columns) we can now see weather data (temp, precip, wind) for each game day in Ogden, UT.

In [11]:
home_df = pd.merge(home_df, raptor_weather, left_on = 'Date', right_on = 'day', how = 'left').drop('day', 1)
home_df['max_temp'] = pd.to_numeric(home_df['max_temp'])
home_df['mean_temp'] = pd.to_numeric(home_df['mean_temp'])
home_df['min_temp'] = pd.to_numeric(home_df['min_temp'])
home_df.head()
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\39735578.py:1: FutureWarning:

In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.

Out[11]:
Date Opponent Score Attendance Home_Away Result win_constant loss_constant total_wins pregame_total_wins ... pregame_win_% Day_of_Week day_classifer Date_string min_temp mean_temp max_temp total_precipitation mean_wind_speed max_sustained_wind_speed
0 2022-05-25 Grand Junction Jackalopes 12|7 3127 Home W 1 0 1 0 ... NaN Wednesday weekday 2022-05-25 45.0 62.5 79.0 0.00 7.02 13.81
1 2022-05-26 Grand Junction Jackalopes 12|7 3940 Home L 0 1 1 1 ... 1.000000 Thursday weekday 2022-05-26 46.9 72.2 91.9 0.00 6.10 11.39
2 2022-05-27 Grand Junction Jackalopes 10|7 3687 Home L 0 1 1 1 ... 0.500000 Friday weekend 2022-05-27 55.0 77.9 91.9 0.00 7.83 14.96
3 2022-05-28 Grand Junction Jackalopes 10|4 2575 Home W 1 0 2 1 ... 0.333333 Saturday weekend 2022-05-28 53.1 65.3 86.0 0.00 9.67 20.83
4 2022-06-07 Northern Colorado Owlz 5|2 3970 Home W 1 0 8 7 ... 0.700000 Tuesday weekday 2022-06-07 51.1 67.0 80.1 0.00 4.49 10.24

5 rows × 23 columns

Is there a correlation between 'attendance' and 'max temperature'?¶

In [12]:
fig = px.scatter(home_df, x="max_temp", y="Attendance"\
                 #, color = "day_classifer", 
                 ,trendline="ols"\
                )
fig.show()

Overall, there is very little correlation between the max temperature and attendance of home Raptors games.¶

Conclusion / Next Steps¶

Because the Raptors only played 44 home games in 2022, it is difficult to see meaningful attendance trends as it relates to 'day of week' and 'temperature'. In addition, the above exploration only looked at those two variables. Other variables that could be interesting to review in connection with attendance in future iterations include:

  • Raptors win/loss record at the time of each game
  • Raptors opponenets
  • How many runs they are averaging in each game...do people just want to see high scoring games?
  • How different marketing/advertising efforts impact attendance